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Abstract — We consider the problem of estimating a deter- 
ministic sparse vector xo from underdetermined measurements 
Axo + w, wliere w represents white Gaussian noise and A is 
a given deterministic dictionary. We analyze the performance 
of three sparse estimation algorithms: basis pursuit denoising 
(BPDN), orthogonal matching pursuit (OMP), and thresholding. 
These algorithms are shown to achieve near-oracle performance 
with high probability, assuming that xq is sufficiently sparse. Our 
results are non-asymptotic and are based only on the coherence 
of A, so that they are applicable to arbitrary dictionaries. 
Differences in the precise conditions required for the performance 
guarantees of each algorithm are manifested in the observed 
performance at high and low signal-to-noise ratios. This provides 
insight on the advantages and drawbacks of £i relaxation 
techniques such as BPDN as opposed to greedy approaches such 
as OMP and thresholding. 

EDICS Topics: SSP-PARE, SSP-PERF. 
Index terms: Sparse estimation, basis pursuit, matching 
pursuit, thresholding algorithm, oracle. 



I. Introduction 

Estimation problems with sparsity constraints have attracted 
considerable attention in recent years because of their poten- 
tial use in numerous signal processing applications, such as 
denoising, compression and sampling. In a typical setup, an 
unknown deterministic parameter xq G M™ is to be estimated 
from measurements b = Axo + w, where A S M"^"' is 
a deterministic matrix and w is a noise vector Typically, 
the dictionary A consists of more columns than rows (i.e., 
TO > n), so that without further assumptions, xq is uniden- 
tifiable from b. The impasse is resolved by assuming that 
the parameter vector is sparse, i.e., that most elements of xo 
are zero. Under the assumption of sparsity, several estimation 
approaches can be used. These include greedy algorithms, such 
as thresholding and orthogonal matching pursuit (OMP) [1], 
and £i relaxation methods, such as the Dantzig selector [2] 
and basis pursuit denoising (BPDN) [3] (also known as the 
Lasso). A comparative analysis of these techniques is crucial 
for determining the appropriate strategy in a given situation. 

There are two standard approaches to modeling the noise w 
in the sparse estimation problem. The first is to assume that 
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w is deterministic and bounded [4]-[6]. This leads to a worst- 
case analysis in which an estimator must perform adequately 
even when the noise maximally damages the measurements. 
The noise in this case is thus called adversarial. By contrast, if 
one assumes that the noise is random, then the analysis aims 
to describe estimator behavior for typical noise values [2], [7], 
[8]. The random noise scenario is the main focus of this paper. 
As one might expect, stronger performance guarantees can be 
obtained in this setting. 

It is common to judge the quality of an estimator by 
comparing its mean-squared error (MSE) with the Cramer- 
Rao bound (CRB) [9]. In the case of sparse estimation under 
Gaussian noise, it has recently been shown that the unbiased 
CRB is identical (for almost all values of Xq) to the MSE of the 
"oracle" estimator, which knows the locations of the nonzero 
elements of xq [10]. Thus, a gold standard for estimator 
performance is the MSE of the oracle. Indeed, it can be shown 
that £i relaxation algorithms come close to the oracle when 
the noise is Gaussian. Results of this type are sometimes 
referred to as "oracle inequalities." Specifically, Candes and 
Tao [2] have shown that, with high probability, the £2 distance 
between Xq and the Dantzig estimate is within a constant times 
logm of the performance of the oracle. Recently, Bickel et 
al. [8] have demonstrated that the performance of BPDN is 
similarly bounded, with high probability, by C log m times the 
oracle performance, for a constant C. However, the constant 
involved in this analysis is considerably larger than that of the 
Dantzig selector Interestingly, it turns out that the log to gap 
between the oracle and practical estimators is an unavoidable 
consequence of the fact that the nonzero locations in xq are 
unknown [11]. 

The contributions [2], [8] state their results using the 
restricted isometry constants (RlCs). These measures of the 
dictionary quality can be efficiently approximated in specific 
cases, e.g., when the dictionary is selected randomly from an 
appropriate ensemble. However, in general it is NP-hard to 
evaluate the RlCs for a given matrix A, and they must then 
be bounded by efficiently computable properties of A, such 
as the mutual coherence [12]. In this respect, coherence -based 
results are appealing since they can be used with arbitrary 
dictionaries [13], [14]. 

In this paper, we seek performance guarantees for sparse 
estimators based directly on the mutual coherence of the ma- 
trix A [15]. While such results are suboptimal when the RlCs 
of A are known, the proposed approach yields tighter bounds 
than those obtained by applying coherence bounds to RIC- 
based results. Specifically, we demonstrate that BPDN, OMP 
and thresholding all achieve performance within a constant 
times log to of the oracle estimator, under suitable conditions. 
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In the case of BPDN, our result provides a tighter guarantee 
than the coherence-based implications of the work of Bickel 
et al. [8J. To the best of our knowledge, there are no prior 
performance guarantees for greedy approaches such as OMP 
and thresholding when the noise is random. 

It is important to distinguish the present work from Bayesian 
performance analysis, as practiced in [13], [16]-[18], where on 
top of the assumption of stochastic noise, a probabiUstic model 
for xo is also used. Our results hold for any specific value of xq 
(satisfying appropriate conditions), rather than providing re- 
sults on average over realizations of xq; this necessarily leads 
to weaker guarantees. It also bears repeating that our results 
apply to a fixed, finite-sized matrix A; this distinguishes our 
work from asymptotic performance guarantees for large m and 
n, such as [19]. 

The rest of this paper is organized as follows. We begin 
in Section II by comparing dictionary quaUty measures and 
reviewing standard estimation techniques. In Section HI, we 
analyze the limitations of estimator performance under ad- 
versarial noise. This motivates the introduction of random 
noise, for which substantially better guarantees are obtained 
in Section IV. Finally, the validity of these results is examined 
by simulation in practical estimation scenarios in Section V. 

The following notation is used throughout the paper. Vectors 
and matrices are denoted, respectively, by boldface lowercase 
and boldface uppercase letters. The set of indices of the 
nonzero entries of a vector x is called the support of x and 
denoted supp(x). Given an index set A and a matrix A, the 
notation Aa refers to the submatrix formed from the columns 
of A indexed by A. The £p norm of a vector x, for 1 < p < oo, 
is denoted ||x||p, while ||x||o denotes the number of nonzero 
elements in x. 

11. Preliminaries 

A. Characterizing the Dictionary 

Let Xq G M™ be an unknown deterministic vector, and 
denote its support set by Aq = supp(xo). Let s = ||xo||o 
be the number of nonzero entries in xq. In our setting, it 
is typically assumed that s is much smaller than m, i.e., 
that most elements in xq are zero. Suppose we obtain noisy 
measurements 

b = Axo + w (1) 

where A G j^nxm ^ known overcomplete dictionary 
(m > n). We refer to the columns aj of A as the atoms 
of the dictionary, and assume throughout our work that the 
atoms are normalized, ||cij||2 = 1. We will consider primarily 
the situation in which the noise w is random, though for 
comparison we will also examine the case of a bounded 
deterministic noise vector; a precise definition of w is deferred 
to subsequent sections. 

For Xq to be identifiable, one must guarantee that different 
values of xq produce significantly different values of b. One 
way to ensure this is to examine all possible subdictionaries, or 
s-element sets of atoms, and verify that the subspaces spanned 
by these subdictionaries differ substantially from one another. 

More specifically, several methods have been proposed to 
formalize the notion of the suitability of a dictionary for 



sparse estimation. These include the mutual coherence [12], 
the cumulative coherence [7], the exact recovery coefficient 
(ERC) [7], the spark [4J, and the RICs [2J, [5]. Except for the 
mutual coherence and cumulative coherence, none of these 
measures can be efficiently calculated for an arbitrary given 
dictionary A. Since the values of the cumulative and mutual 
coherence are quite close, our focus in this paper will be on 
the mutual coherence /x = /ti(A), which is defined as 

A I T I /ON 

= max Bj SLj . (2) 

While the mutual coherence can be efficiently calculated 
directly from (2), it is not immediately clear in what way n 
is related to the requirement that subdictionaries must span 
different subspaces. Indeed, /i ensures a lack of correlation 
between single atoms, while we require a distinction between 
s-element subdictionaries. To explore this relation, let us recall 
the definitions of the RICs, which are more directly related to 
the subdictionaries of A. We will then show that the mutual 
coherence can be used to bound the constants involved in the 
RICs, a fact which will also prove useful in our subsequent 
analysis. This strategy is inspired by earher works, which have 
used the mutual coherence to bound the ERC [7] and the spark 
[4]. Thus, the coherence can be viewed as a tractable proxy for 
more accurate measures of the quality of a dictionary, which 
cannot themselves be calculated efficiently. 

By the RICs we refer to two properties describing "good" 
dictionaries, namely, the restricted isometry property (RIP) and 
the restricted orthogonality property (ROP), which we now 
define. A dictionary A is said to satisfy the RIP [5] of order s 
with parameter 5s if, for every index set A of size s, we have 

(l-<5,)||y||2<||AAy||^<(l + <5.)l|yl|^ (3) 

for all y G M*. Thus, when 5s is small, the RIP ensures that 
any s-atom subdictionary is nearly orthogonal, which in turn 
implies that any two disjoint (s/2)-atom subdictionaries are 
well-separated. 

Similarly, A is said to satisfy the ROP [2] of order [81,82) 
with parameter 6si.s2 if, for every pair of disjoint index sets 
Al and A2 having cardinaUties 8\ and S2. respectively, we 
have 

|yfAX^AA2y2| < 6's,,sJ|yi||2||y2||2 (4) 

for all yi € and for all ya G W\ In words, the ROP 
requires any two disjoint subdictionaries containing s\ and 
82 elements, respectively, to be nearly orthogonal to each 
other. These two properties are therefore closely related to 
the requirement that distinct subdictionaries of A behave 
dissimilarly. 

In recent years, it has been demonstrated that various 
practical estimation techniques successfully approximate xq 
from b, if the constants 5s and ^si,s2 sufficiently small 
[2], [5], [20]. This occurs, for example, when the entries 
in A are chosen randomly according to an independent, 
identically distributed Gaussian law, as well as in some specific 
deterministic dictionary constructions. 

Unfortunately, in the standard estimation setting, one cannot 
design the system matrix A according to these specific rules. 
In general, if one is given a particular dictionary A, then 
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there is no known algorithm for efficiently determining its 
RICs. Indeed, the very nature of the RICs seems to require 
enumerating over an exponential number of index sets in order 
to find the "worst" subdictionary. While the mutual coherence 
/X of (2) tends to be far less accurate in capturing the accuracy 
of a dictionary, it is still useful to be able to say something 
about the RICs based only on fi. Such a result is given in the 
following lemma. 

Lemma 1: For any matrix A, the RIP constant 6s of (3) 
and the ROP constant ^si,s2 of (4) satisfy the bounds 



< (S - 1)/", 



(5) 
(6) 



where ji is the mutual coherence (2). 

The proof of Lemma 1 can be found in Appendix I. We 
will apply this lemma in Section IV, when examining the 
performance of the Dantzig selector. This tool can also be 
used in conjunction with other results that rely on the RIP 
and ROP. 

B. Estimation Techniques 

To fix notation, we now briefly review several approaches 
for estimating Xq from noisy measurements b given by (1). 
The two main strategies for efficiently estimating a sparse 
vector are li relaxation and greedy methods. The first of 
these involves solving an optimization problem wherein the 
nonconvex constraint ||xo||o = s is relaxed to a constraint 
on the £i norm of the estimated vector Xq. Specifically, we 
consider the f i-penalty version of BPDN, which estimates xq 
as a solution xbp to the quadratic program 



min ^||b — Ax||2 



+ 7l|x||i 



(7) 



for some regularization parameter 7. We refer to the optimiza- 
tion problem (7) as BPDN, although it should be noted that 
some authors reserve this term for the related optimization 
problem 

min ||x||i s.t. ||b - AxU^ < 5 (8) 

X 

where 5 is a given constant. 

Another estimator based on the idea of £1 relaxation is 
the Dantzig selector [2], defined as a solution xds to the 
optimization problem 



mm 



|x||i s.t. ||A^(b- Ax)||oo < r 



(9) 



where r is again a user-selected parameter. The Dantzig 
selector, like BPDN, is a convex relaxation method, but rather 
than penalizing the I2 norm of the residual b— Ax, the Dantzig 
selector ensures that the residual is weakly correlated with all 
dictionary atoms. 

Instead of solving an optimization problem, greedy ap- 
proaches estimate the support set Aq from the measurements 
b. Once a support set A is chosen, the parameter vector xq 
can be estimated using least-squares (LS) to obtain 



A]\^b on the support set A, 
elsewhere. 



Greedy techniques differ in the method by which the support 
set is selected. The simplest method is known as the threshold- 
ing algorithm. This technique computes the correlation of the 
measured signal b with each of the atoms and defines A as 
the set of indices of the s atoms having the highest correlation. 
Subsequently, the LS technique (10) is applied to obtain the 
thresholding estimate Xth. 

A somewhat more sophisticated greedy algorithm is OMP 
[1]. This iterative approach begins by initializing the estimated 
support set A'' to the empty set and setting a residual vector r*^ 
to b. Subsequently, at each iteration i = 1, . . . , s, the algorithm 
finds the single atom which is most highly correlated with 
r'^^. The index of this atom, say ki, is added to the support 
set, so that A* = A*~^ U {ki}. The estimate Xqj^p at the 
ith iteration is then defined by the LS solution (10) using the 
support set A*. Next, the residual is updated using the formula 



r' = b - Ax: 



OMP- 



(11) 



(10) 



The residual thus describes the part of b which has yet to be 
accounted for by the estimate. The counter i is now incre- 
mented, and s iterations are performed, after which the OMP 
estimate xqmf is defined as the estimate at the final iteration, 
^OMP- ^ well-known property of OMP is that the algorithm 
never chooses the same atom twice [4J. Consequently, stopping 
after s iterations guarantees that ||xomp||o = s. 

Finally, we also mention the so-called oracle estimator, 
which is based both on b and on the true support set Aq 
of Xq; the support set is assumed to have been provided by an 
"oracle". The oracle estimator Xor calculates the LS solution 
(10) for Aq. In the case of white Gaussian noise, the MSB 
obtained using this technique equals that of the CRB [10]. 
Thus, it makes sense to use the oracle estimator as a gold 
standard against which the performance of practical algorithms 
can be compared. 

111. Performance under Adversarial Noise 

In this section, we briefly discuss the case in which the 
noise w is an unknown deterministic vector which satisfies 
||w||2 < e. As we will see, performance guarantees in this 
case are rather weak, and indeed no denoising capability can 
be ensured for any known algorithm. In Section IV, we will 
compare this setting with the results which can be obtained 
when w is random. 

Typical "stability" results under adversarial noise guarantee 
that if the mutual coherence of A is sufficiently small, and if 
Xq is sufficiently sparse, then the distance between xq and its 
estimate is on the order of the noise magnitude. Such results 
can be derived for algorithms including BPDN, OMP, and 
thresholding. Consider, for example, the following theorem, 
which is based on the work of Tropp |7, §IV-C].^ 

Theorem 1 (Tropp): Let xq be an unknown deterministic 
vector with known sparsity ||xo||o = s, and let b = Axq H- w, 
where ||w||2 < e. Suppose the mutual coherence ji of the 
dictionary A satisfies s < l/(3/x). Let xbp denote a solution 

' Tropp considers only the case in which the entries of xq belong to the set 
{0, ±1}. However, since the analysis performed in [7, §IV-C] can readily be 
applied to the general setting considered here, we omit the proof of Theorem 1. 
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of BPDN (7) with regularization parameter 7 = 2e. Then, xbp 
is unique, the support of xbp is a subset of the support of xq, 
and 



l|xo - XBP II 00 < \ 3 + yffj e « 4.22e. (12) 

Results similar to Theorem 1 have also been obtained [4], 
[5], [14], [20] for the related £i-error estimation approach 
(8), as well as for the OMP algorithm [4]. Furthermore, 
the technique used in the proof for the OMP [4] can also 
be applied to demonstrate a (slightly weaker) performance 
guarantee for the thresholding algorithm. 

In all of the aforementioned results, the only guarantee is 
that the distance between xbp and xq is on the order of 
the noise power e. Such results are somewhat disappointing, 
because one would expect the knowledge that xq is sparse to 
assist in denoising; yet Theorem 1 promises only that the £00 
distance between xbp and xq is less than about four times the 
maximum noise level. However, the fact that no denoising has 
occurred is a consequence of the problem setting itself, rather 
than a limitation of the algorithms proposed above. In the 
adversarial case, even the oracle estimator can only guarantee 
an estimation error on the order of e. This is because w can 
be chosen so that w e span(AA(j), in which case projection 
onto span(AAo), as performed by the oracle estimator, does 
not remove any portion of the noise. 

In conclusion, results in this adversarial context must take 
into account values of w which are chosen so as to cause 
maximal damage to the estimation algorithm. In many prac- 
tical situations, such a scenario is overly pessimistic. Thus, 
it is interesting to ask what guarantees can be made about 
the performance of practical estimators under the assumption 
of random (and thus non-adversarial) noise. This scenario is 
considered in the next section. 

IV. Performance under Random Noise 

We now turn to the setting in which the noise w is a 
Gaussian random vector with mean and covariance cr^I. In 
this case, it can be shown [10] that the MSE of any unbiased 
estimator of xq satisfies the Cramer-Rao bound 

MSE(x) > CRB = Tr((Al„ Aao)"') (13) 

whenever |]xo||o = s- Interestingly, CRB is also the MSE of 
the oracle estimator [2]. 

It follows from the Gershgorin disc theorem [21] that 
all eigenvalues of AJ^^Aao are between 1 — (s — l)/z and 
1 + (s + Therefore, for reasonable sparsity levels, 

Tr((Aj^ Aao)~^) is not much larger than s; for example, if 
we assume, as in Theorem 1, that s < then CRB 

of (13) is no larger than |scr^. Considering that the mean 
power of w is ncr^, it is evident that the oracle estimator has 
substantially reduced the noise level. In this section, we will 
demonstrate that comparable performance gains are achievable 
using practical methods, which do not have access to the 
oracle. 

A. ii-Relaxation Approaches 

Historically, performance guarantees under random noise 
were first obtained for the Dantzig selector (9). The result. 



due to Candes and Tao [2], is derived using the RJCs (3)-(4). 
Using the bounds of Lermna 1 yields the following coherence- 
based result. 

Theorem 2 (Candes and Tao): Let xq be an unknown de- 
terministic vector such that ||xo||o = s, and let b = Axq + w, 
where w ~ N{Q, cr^I) is a random noise vector. Assume that 



s < 1 



1 



(1 + V2)m 

and consider the Dantzig selector (9) with parameter 



(14) 



r = aV2(T+a)logm (15) 
for some constant a > 0. Then, with probabiUty exceeding 



1 



1 



m"\/7rlogm' 
the Dantzig selector xds satisfies 



||xo - xdsIII < 2ci(l + a)sa'^ logm 



where 



Cl 



(16) 



(17) 



(18) 



1- ((1 + V2)5-1)m' 
This theorem is significant because it demonstrates that, 

while Xds does not quite reach the performance of the oracle 

estimator, it does come within a constant factor multiplied by 

logm, with high probability. Interestingly, the logm factor 

is an unavoidable result of the fact that the locations of the 

nonzero elements in xq are unknown (see [11, §7.4] and the 

references therein). 

It is clearly of interest to determine whether results similar 
to Theorem 2 can be obtained for other sparse estimation 
algorithms [22], [23]. In this context, Bickel et al. [8] have 
recently shown that, with high probability, BPDN also comes 
within a factor of Clogm of the oracle performance, for 
a constant C. In fact, their analysis is quite versatile, and 
simultaneously provides a result for both the Dantzig selector 
and BPDN. However, the constant C obtained in this BPDN 
guarantee is always larger than 128, often substantially so; 
this is considerably weaker than the result of Theorem 2. 
Furthermore, while the necessary conditions for the results of 
Bickel et al. are not directly comparable with those of Candes 
and Tao, an application of Lemma 1 indicates that coherence- 
based conditions stronger than (14) are required for the results 
of Bickel et al. to hold. 

In the following, we obtain a coherence-based performance 
guarantee for BPDN. In particular, we demonstrate that, for 
an appropriate choice of the regularization parameter 7, the 
squared error of the BPDN estimate is bounded, with high 
probability, by a small constant times sa^ log(m — s), and 
that this constant is lower than that of Theorem 2. We begin 
by stating the following somewhat more general result, whose 
proof is found in Appendix 11. 

Theorem 3: Let xq be an unknown deterministic vector 
with known sparsity ||xo||o = s, and let b = Axq -j-w, where 
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w A'^(0, cr^I) is a random noise vector. Suppose that^ 



Then, with probability exceeding 

(l-(m-.)exp(-^))(l-e-Vr), 
the solution xbp of BPDN (7) is unique and satisfies 



(19) 



(20) 



||xo-XBp||i< (aV3+|7) s. (21) 
To compare the results for BPDN and the Dantzig selector, 
we now derive from Theorem 3 a result which holds with a 
probability on the order of (16). Observe that in order for 
(20) to be a high probability, we require exp(— 7^/(8cr^)) 
to be substantially smaller than 1/(to — s). This requirement 
can be used to select a value for the regularization parameter 
7. In particular, one requires 7 to be at least on the order 
of -^/Scr^ log(m — s). However, 7 should not be much larger 
than this value, as this will increase the error bound (21). We 
propose to use 
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-v/8<t2(1 + a)log(m- s) 



(22) 



for some fairly small a > 0. Substituting this value of 7 into 
Theorem 3 yields the following result. 

Corollary 1: Under the conditions of Theorem 3, let xbp 
be a solution of BPDN (7) with 7 given by (22). Then, with 
probability exceeding 

we have 

||xo-XBp||2 < (\/3 + 3v/2(l + a) log(m - s)) ^ (24) 
Let us examine the probability (23) with which Corollary 1 
holds, to verify that it is indeed roughly equal to (16). The 
expression (23) consists of a product of two terms, both of 
which converge to 1 as the problem dimensions increase. The 
right-hand term may seem odd because it appears to favor 
non-sparse signals; however, this is an artifact of the method 
of proof, which requires a sufficient number of nonzero 
coefficients for large number approximations to hold. This 
right-hand term converges to 1 exponentially and therefore 
typically has a negligible effect on the overall probability of 
success; for example, for s > 50 this term is larger than 0.999. 

The left-hand term in (23) tends to 1 polynomially as m — s 
increases. This is a slightly lower rate than the probability 
(16) with which the Dantzig selector bound holds; however, 
this difference is compensated for by a correspondingly lower 
multiplicative factor of log(m — s) in the BPDN error bound 
(24), as opposed to the logm factor in the Dantzig selector. 
In any case, for both theorems to hold, m must increase much 
more quickly than s, so that these differences are negligible. 

For large s and m — .s, Corollary 1 ensures that, with 
high probabiUty, ||xbp — Xo||| is no larger than a constant 

^As in [7], analogous findings can also be obtained under the weaker 
requirement s < l/(2/i), but the resulting expressions are somewhat more 
involved. 



multiplied by scr^ log(m — s). Up to a multiplicative constant, 
this error bound is essentially identical to the result (17) 
for the Dantzig selector. As we have seen, the probabilities 
with which these bounds hold are likewise almost identical. 
However, the constants involved in the BPDN, as demonstrated 
by Corollary 1, are substantially lower than those previously 
known for the Dantzig selector. To see this, consider a situation 
in which s = 1/ (4/i). In this case, for large s, the bound (17) 
on the Dantzig selector rapidly converges to 



||xo-XDs||2< 203.6(1 



log m ■ 



(25) 



By comparison, the performance of BPDN in the same setting, 
as bounded by Corollary 1, is 



||xo - Xbp 



\l < 18(1 + a) •log(m-s)- 



(26) 



which is over 10 times lower. This improvement is not 
merely a result of the particular choice of s or /x. Indeed, 
the multiplicative factor of 18 which appeared in the BPDN 
bound (26) holds for large s with any value of //, as long as 
s < 1/(3^); whereas it can be seen from (17)-(18) that the 
multiplicative factor of the Dantzig selector is always larger 
than 32. Further comparison between these guarantees wiU be 
presented in Section V. 



B. Greedy Approaches 

The performance guarantees obtained for the £1 -relaxation 
techniques required only the assumption that xq is sufficiently 
sparse. By contrast, for greedy algorithms, successful estima- 
tion can only be guaranteed if one further assumes that all 
nonzero components of Xo are somewhat larger than the noise 
level. The reason is that greedy techniques are based on a LS 
solution for an estimated support, an approach whose efficacy 
is poor unless the support is correctly identified. Indeed, 
when using the LS technique (10), even a single incorrectly 
identified support element may cause the entire estimate to 
be severely incorrect. To ensure support recovery, all nonzero 
elements must be large enough to overcome the noise. 

To formalize this notion, denote xq = (a;o,i, . . . , aJcm)'^ 
and define 



FminI = mm |a;o,i|, 

ieAo 

kmaxi = max|a;o,i|. 

ieAo 



(27) 



A performance guarantee for both OMP and the thresholding 
algorithm is then given by the following theorem, whose proof 
can be found in Appendix 111. 

Theorem 4: Let xq be an unknown deterministic vector 
with known sparsity ||xo||o = s, and let b = Axq + w, where 
w N(0, cr^I) is a random noise vector. Suppose that 



I - (2s - l)M|a;min| > 2(7^2(1 a) logm (28) 

for some constant a > 0. Then, with probability at least 

1 



1 - 



m"-\/7r(l -|- a) logm' 



(29) 
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the OMP estimate xqmp identifies the correct support Aq of 
xo and, furthermore, satisfies 

in ||2. 2(l + a) 2, ^ 

xoMP - Xo 2 < Tx 1 TT^^<^ \o^m (30a) 

(1-(S-1)/Z)2 

< 8(1 + a)sa^ logm. (30b) 
If the stronger condition 

I - (2s - \)yi\x^^y\ > 2crv/2(l + a)logm (31) 

holds, then with probabihty exceeding (29), the thresholding 
algorithm also correctly identifies Aq and satisfies (30). 

The performance guarantee (30) is better than that provided 
by Theorem 2 and Corollary 1. However, this result comes at 
the expense of requirements on the magnitude of the entries 
of xq. Our analysis thus suggests that greedy approaches may 
outperform £i-based methods when the entries of xq are large 
compared with the noise, but that the greedy approaches will 
deteriorate when the noise level increases. As we will see in 
Section V, simulations also appear to support this conclusion. 

It is interesting to compare the success conditions (28) 
and (31) of the OMP and thresholding algorithms. For given 
problem dimensions, the OMP algorithm requires Ixmin]. the 
smallest nonzero element of xq, to be larger than a constant 
multiple of the noise standard deviation a. This is required 
in order to ensure that all elements of the support of xq 
will be identified with high probability. The requirement of 
the thresholding algorithm is stronger, as befits a simpler 
approach: In this case |a;ini„| must be larger than the noise 
standard deviation plus a constant times |a;max|- In other 
words, one must be able to separate [.T„iin| from the combined 
effect of noise and interference caused by the other nonzero 
components of xq. This results from the thresholding tech- 
nique, in which the entire support is identified simultaneously 
from the measurements. By comparison, the iterative approach 
used by OMP identifies and removes the large elements in xq 
first, thus facilitating the identification of the smaller elements 
in later iterations. 

V. Numerical Results 

In this section, we describe a number of numerical experi- 
ments comparing the performance of various estimators to the 
guarantees of Section IV. Our first experiment measured the 
median estimation error, i.e., the median of the £2 distance 
between xg and its estimate. The median error is intuitively 
appealing as it characterizes the "typical" estimation error, and 
it can be readily bounded by the performance guarantees of 
Section IV. 

Specifically, we chose the two-ortho dictionary A = [I H], 
where I is the 512 x 512 identity matrix and H is the 512 x 
512 Hadamard matrix with normalized columns. The RICs of 
this dictionary are unknown, but the coherence can be readily 
calculated and is given by = l/\/512. Consequently, the 
theorems of Section IV can be used to obtain performance 
guarantees for sufficiently sparse vectors. In particular, in our 
simulations we chose parameters xq having a support of size 
s = 7. The smallest nonzero entry in xq was \xniin\ =0.1 



and the largest entry was |a;,nax| = 1- Under these conditions, 
applying the theorems of Section IV yields the bounds^ 

||xo -X0MPII2 < 3.7.s(T2logm w.p. |, if cr < 0.057; 
||xo - XBPII2 < 22.1s(T^logm w.p. i; 
||xo - XDSII2 < 361. Sscr^ logm w.p. |. (32) 

We have thus obtained guarantees for the median estimation 
error of the Dantzig selector, BPDN, and OMP. Under these 
settings, no guarantee can be made for the performance 
of the thresholding algorithm. Indeed, as we will see, for 
some choices of xq satisfying the above requirements, the 
performance of the thresholding algorithm is not proportional 
to scr^ log m. To obtain thresholding guarantees, one requires 
a narrower range between |a;min| and |xmax|- 

To measure the actual median error obtained by various 
estimators, 8 different parameter vectors xq were selected. 
These differed in the distribution of the magnitudes of the 
nonzero components within the range [|.T,nin|, |a;max|] and in 
the locations of the nonzero elements. For each parameter xq, 
a set of measurement vectors b were obtained from (1) by 
adding white Gaussian noise. The estimation algorithms of 
Section II-B were then apphed to each measurement real- 
ization; for the Dantzig selector and BPDN, the parameters 
r and 7 were chosen as the smallest values such that the 
probabilities of success (16) and (23), respectively, would 
exceed 1/2. The median over noise realizations of the distance 
||xo — xjll was then computed for each estimator. This process 
was repeated for 10 values of the noise variance cr^ in the 
range 10~^ < <t^ < 1. The results are plotted in Fig. 1 as a 
function of a^. 

It is evident from Fig. 1 that some parameter vectors are 
more difficult to estimate than others. Indeed, there is a large 
variety of parameters xq satisfying the problem requirements, 
and it is hkely that some of them come closer to the theo- 
retical Umits than the parameters chosen in our experiment. 
This highlights the importance of performance guarantees in 
ensuring adequate performance for all parameter values. On 
the other hand, it is quite possible that further improvements 
of the constants in the performance bounds are possible. For 
example, the Dantzig selector guarantee, which is obtained by 
applying coherence bounds to RIC-based results [2], is almost 
100 times higher than the worst of the examined parameter 
values. It should also be noted that applying coherence bounds 
to RIC-based BPDN guarantees [8] yields a bound which 
applies to the aforementioned matrix A only when s < 3, 
and thus cannot be used in the present setting. Therefore, it 
appears that when deaUng with dictionaries for which only 
the coherence fi is known, guarantees based directly on n are 
tighter than RIC-based results. 

In practice, it is more common to measure the MSE of an 
estimator than its median error Our next goal is to determine 
whether the behavior predicted by our theoretical analysis is 
also manifested in the MSE of the various estimators. To this 

'in the current setting, the results for the Dantzig selector (Theorem 2) 
and OMP (Theorem 4) can only be used to yield guarantees holding with 
probabilities of approximately 3/4 and higher. These are, of course, also 
bounds on the meilian error. 
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Noise Variance Noise Variance 

(c) OMP (d) Thresholding 



Fig. 1 . Median estimation error for practical estimators (solid line) compared with the performance guarantees (dashed line) and the oracle estimator (dotted 
line). The sohd hnes report performance for 8 different values of the unknown parameter vector xq. For OMP, performance is only guaranteed for a < 0.057, 
while for thresholding, nothing can be guaranteed for the given problem dimensions. 



end, we conducted an experiment in which the MSEs of the 
estimators of Section II-B were compared. In this simulation, 
we chose the two-ortho dictionary A = [I H], where I is the 
256 X 256 identity matrix and H is the 256 x 256 Hadamard 
matrix with normaUzed columns."* Once again, the RICs of 
this dictionary are unknown. However, the coherence in this 
case is given by /i = 1/16, and consequently, the ii relaxation 
guarantees of Section IV- A hold for s < 5. 

We obtained the parameter vector xq for this experiment 
by selecting a 5 -element support at random, choosing the 
nonzero entries from a white Gaussian distribution, and then 
normalizing the resulting vector so that ||xo||2 = 1- The 

■^Similar experiments were performed on a variety of other dictionaries, 
including an overcomplete DCT [24] and a matrix containing Gaussian 
random entries. The different dictionaries yielded comparable results, which 
are not reported here. 



regularization parameters r and 7 of the Dantzig selector 
and BPDN were chosen as recommended by Theorem 2 and 
Corollary 1, respectively; for both estimators a value of a = 1 
was chosen, so that the guaranteed probability of success for 
the two algorithms has the same order of magnitude. The 
MSE of each estimate was then calculated by averaging over 
repeated realizations of xq and the noise. The experiment was 
conducted for 10 values of the noise variance and the results 
are plotted in Fig. 2 as a function of the signal-to-noise ratio 
(SNR), which is defined by 

SNR=MI = J_. (33) 

To compare this plot with the theoretical results of Sec- 
tion IV, observe first the situation at high SNR. In this case, 
OMP, BPDN, and the Dantzig selector all achieve performance 
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Fig. 2. MSB of various estimators as a function of the SNR. The sparsity 
level is s = 5 and the dictionary is a 256 X 512 two-ortho matrix. 

which is proportional to the oracle MSE (or CRB) given by 
(13). Among these, OMP is closest to the CRB, followed 
by BPDN and, finally, the Dantzig selector This behavior 
matches the proportionality constants given in the theorems 
of Section IV. Indeed, for small ct, the condition (28) holds 
even for large a, and thus Theorem 4 guarantees that OMP 
will recover the correct support of xq with high probability, 
explaining the convergence of this estimator to the oracle. 
By contrast, the performance of the thresholding algorithm 
levels off at high SNR; this is again predicted by Theorem 4, 
since, even when a = 0, the condition (31) does not always 
hold, unless |a;niin| is not much smaller than Ixmaxl- Thus, for 
our choice of xo. Theorem 4 does not guarantee near-oracle 
performance for the thresholding algorithm, even at high SNR. 

With increasing noise. Theorem 4 requires a corresponding 
increase in \xynin\ to guarantee the success of the greedy 
algorithms. Consequently, Fig. 2 demonstrates a deterioration 
of these algorithms when the SNR is low. On the other 
hand, the theorems for the relaxation algorithms make no 
such assumptions, and indeed these approaches continue to 
perform well, compared with the oracle estimator, even when 
the noise level is high. In particular, the Dantzig selector 
outperforms the CRB at low SNR; this is because the CRB 
is a bound on unbiased techniques, whereas when the noise 
is large, biased techniques such as an £i penalty become very 
effective. Robustness to noise is thus an important advantage 
of £i-relaxation techniques. 

It is also interesting to examine the effect of the support 
size s on the performance of the various estimators. To this 
end, 15 support sizes in the range 2 < s < 30 were tested. For 
each value of s, random vectors xq having s nonzero entries 
were selected as in the previous simulation. The dictionary 
A was the 256 x 512 two-ortho matrix defined above; as in 
the previous experiment, other matrices were also tested and 
provided similar results. The standard deviation of the noise 
for this experiment was a — 0.01. The results are plotted in 
Fig. 3. 




5 10 15 20 25 30 



Number of nonzero coefficients, s 

Fig. 3. MSE of various estimators as a function of the support size s. The 
noise standard deviation is cr = 0.01 and the dictionary is a 256 X 512 
two-ortho matiix. 

As mentioned above, the mutual coherence of the dictionary 
A is 1/16, so that the proposed performance guarantees 
apply only when xq is quite sparse (s < 5). Nevertheless, 
Fig. 3 demonstrates that the estimation algorithms (with the 
exception of the thresholding approach) exhibit a graceful 
degradation as the support of Xq increases. At first sight 
this would appear to mean that the performance guarantees 
provided are overly pessimistic. For example, it is possible 
that the RICs in the present setting, while unknown, are fairly 
low and permit a stronger analysis than that of Section IV. It 
is also quite reasonable to expect, as mentioned above, that 
some improvement in the theoretical guarantees is possible. 
However, it is worth recalling that the performance guarantees 
proposed in this paper apply to all sparse vectors, while 
the numerical results describe the performance averaged over 
different values of xq. Thus it is possible that there exist 
particular parameter values for which the performance is 
considerably poorer than that reported in Fig. 3. Indeed, there 
exist values of A and xq for which BPDN yields grossly 
incorrect results even when ||xo||o is on the order of l//i [13]. 
However, identifying such worst-case parameters numerically 
is quite difficult; this is doubtlessly at least part of the reason 
for the apparent pessimism of the performance guarantees. 

VI. Conclusion 

The performance of an estimator depends on the problem 
setting under consideration. As we have seen, under the adver- 
sarial noise scenario of Section III, the estimation error of any 
algorithm can be as high as the noise power; in other words, 
the assumption of sparsity has not yielded any denoising 
effect. On the other hand, in the Bayesian regime in which 
both Xo and the noise vector are random, practical estimators 
come close to the performance of the oracle estimator [13]. 
In Section IV, we examined a middle ground between these 
two extremes, namely the setting in which xo is deterministic 
but the noise is random. As we have shown, despite the fact 
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that less is known about xo in this case than in the Bayesian 
scenario, a variety of estimation techniques are still guaranteed 
to achieve performance close to that of the oracle estimator. 

Our theoretical and numerical results suggest some conclu- 
sions concerning the choice of an estimator. In particular, at 
high SNR values, it appears that the greedy OMP algorithm 
has an advantage over the other algorithms considered herein. 
In this case the support set of xq can be recovered accurately 
and OMP thus converges to the oracle estimator; by contrast, 
li relaxations have a shrinkage effect which causes a loss 
of accuracy at high SNR. This is of particular interest since 
greedy algorithms are also computationally more efficient 
than relaxation methods. On the other hand, the li relaxation 
techniques, and particularly the Dantzig selector, appear to 
be more effective than the greedy algorithms when the noise 
level is significant: in this case, shrinkage is a highly effective 
denoising technique. Indeed, as a result of the bias introduced 
by the shrinkage, -based approaches can even perform better 
than the oracle estimator and the Cramer-Rao bound. 

Appendix 1 
Proof of Lemma 1 

By Gershgorin's disc theorem [21], all eigenvalues of 
AJAa are between 1 — (,s — and 1 -|- (s — l)/ti. Combining 
this with the fact that, for all y, 

A„,i„(AlAA)l|y||2 < IIAayII^ < A„ 



,(AiAA)l|y||i (34) 



we obtain (5). Next, to demonstrate (6), observe that 
|yf Aj^AA,y2| < |yf | • IaJ^Aa^] • |y2| 



(35) 



where the absolute value of a matrix or vector is taken 
elementwise. Since AJ^Aas is a submatrix of A^A which 
does not contain any of the diagonal elements of A-^A, it 
follows that each element in AJ^Aaj is smaller in absolute 
value than ji. Thus 

|yf Aj^AA.yal < M |yf | 11^ |y2| = A*l|yi||i||y2||i (36) 

where 1 indicates a vector of ones. Using the fact that ||y||i < 
V^||y||2 for any s- vector y, we obtain 



(37) 



|yi AAiAA2y2| < MVsIsi||yi||2||y2||2, 
which implies that 6*81,82 satisfies (6). 

Appendix II 
Proof of Theorem 3 

The proof is based closely on the work of Tropp [7]. From 
the triangle inequality. 



Ixo - XBPII2 < !|xo - Xor||2 + ||Xor - XBp||2 



(38) 



where Xor is the oracle estimator. Our goal is to separately 
bound the two terms on the right-hand side of (38). Indeed, 
as we will see, the two constants o-\/3 and §7 in (21) arise, 
respectively, from the two terms in (38). 

Begiiming with the term ||xo — Xor||2> let xq^a denote the 
s-vector containing the elements of xq indexed by Aq, and 



similarly, let Xor,A denote the corresponding subvector of Xor- 
We then have 

Xo,A - Xor,A = Xo,A - (Axq + w) 

= Xo,A - A]^^ (AaoXo,A + w) 



-At w 

Ao 



(39) 



where we have used the fact that Aaq has full column rank, 
which is a consequence [25] of the condition (19). Thus, 
Xo,A — Xor, A is a Gaussiau random vector with mean and 
covariance a^Aj^^A]^^ = a2(Aj^AAo)-^ 

For future use, we note that the cross-correlation between 
k\w and (I - AaoA^vJw is 

i^{Al„ww^(I - AaoAIJ^} = a^A\^{l - A^,A\f 

= 0, (40) 

where we have used the fact [26, Th. 1.2.1] that for any matrix 
M 

M+M+'^M'^ = (]V[^M)1'M'^ = Mt. (41) 

Since w is Gaussian, it follows that Aj^^w and (I — 
AaqA^y )w are statistically independent. Furthermore, be- 
cause xo.A — Xor, A depends on w only through A]^^w, we 
conclude that 

Xq — Xor is Statistically independent of (I — Aa^A]^ )w. 

(42) 

We now wish to bound the probability that Hxq — x, 



■or II 2 



> 



3s(7^. Let z be a normalized Gaussian random variable, z 
N{Q,ls). Then 



Pr{||xo -Xor||5 
= Pr 



2 > SSCT^} 



< Pr 



a(Al„AAj-i/^ 
(A^AaJ-V^ 



> 3s 



(43) 



where ||M|| denotes the maximum singular value of the matrix 
M. Thus, ||(AJ^^Aao)"^/^|| = 1/smin, where s^in is the 
minimum singular value of Aao- From the Gershgorin disc 
theorem [21, p. 320], it follows that Smm > \/l — (s — 1)^- 
Using (19), this can be simplified to Srnin > -\/2/3, and 
therefore 

(AI„AaJ-i/^ <^/f. (44) 



Combining with (43) yields 

Pr{||xo-Xor||^ >3.sa2} < Pr{ ||z||2 > 2s} . (45) 

Observe that ||z||2 is the sum of s independent normalized 
Gaussian random variables. The right-hand side of (45) is 
therefore 1 — (2s), where F^2 (•) is the cumulative distribu- 
tion function of the distribution with s degrees of freedom. 
Using the formula [27, §16.3] for F^2{-), we have 

Pr{||xo-Xor||i >3sc72} < g(is,s) (46) 

where Q{a, z) is the regularized Gannma function 

J^t"-^e-*dt 



Q{a,z) 



(47) 



10 



Q{^s, s) decays exponentially as s — > oo, and it can be seen 
that 

Q(is, s) < e"*/^ for all s. (48) 



We thus conclude that the event 



Xo 



1^ < 350-2 



(49) 



occurs with probability no smaller than 1 — e"*/^. Note that 
the same technique can be appUed to obtain bounds on the 
probability that Hxq — Xorjli > oisa"^, for any a > |. The 
only difference will be the rate of exponential decay in (48). 
However, the distance between xq and Xor is usually small 
compared with the distance between Xor and xbp, so that such 
an approach does not significantly affect the overall result. 

The above calculations provided a bound on the first term 
in (38). To address the second term ||xor — xbp||2, define the 
random event 



G : max 



af(I-AA„Aljb 



(50) 



where is the ith column of A. It is shown in [7, App. IV- A] 
that 

Pr{G}> l-(m-s)exp(^-^^ . (51) 

If G indeed occurs, then the portion of the measurements b 
which do not belong to the range space of Aaq are small, 
and consequently it has been shown [7, Cor. 9] that, in this 
case, the solution xbp to (7) is unique, the support of xbp is 
a subset of Aq, and 



XBP - Xo 



< 



(52) 



Since both xbp and Xor are nonzero only in Aq, this impUes 
that 

||xbp - Xorlh < |7\/s- (53) 

The event G depends on the random variable w only 
through (I - Aa„A]^Jw. Thus, it follows from (42) that G 
is statistically independent of the event (43). The probability 
that both events occur simultaneously is therefore given by the 
product of their respective probabilities. In other words, with 
probability exceeding (20), both (53) and (49) hold. Using (38) 
completes the proof of the theorem. 

Appendix III 
Proof of Theorem 4 

The claims concerning both algorithms are closely related. 
To emphasize this similarity, we first provide several lemmas 
which will be used to prove both results. These lemmas are 
all based on an analysis of the random event 



where 



B = \ max |af w| < r \ 



r = + a) log? 



(54) 



(55) 



and a > 0. Our proof will be based on demonstrating that 
B occurs with high probability, and that when B does occur, 
both thresholding and OMP achieve near-oracle performance. 



Lemma 2: Suppose that w ^ N{Q, cr^I). Then, the event 
B of (54) occurs with a probability of at least (29). 

Proof: The random variables {af w}™ ^ are jointly 
Gaussian. Therefore, by Sidak's lennma [28, Th. 1] 

Pr{B} = Prj max |afw| < r 1 > J]^Pr{|afw| < r} . 

(56) 

Since ||ai||2 = 1, each random variable afw has mean zero 
and variance a^. Consequently, 

Pr{|afw| <r} = 1-2q(^) (57) 

where Q{x) = (l/\/27r) e~^^^'^dz is the Gaussian tail 
probabihty. Using the bound 

Q{x) < ^=e-"'/2 (58) 



we obtain from (57) 



where 



x\/2tt 

Pr{|afw| <r} > 1-r? 

2 a 



V = 



V2<7^ 



(59) 



(60) 



When rj > 1, the bound (29) is meaningless and the theorem 
holds vacuously. Otherwise, when r; < 1, we have from (56) 
and (59) 

Pr{B} > (1 - r?)" > 1 - mr? (61) 

where the final inequality holds for any rj < 1 and m > 1. 
Substituting the values of rj and r and simplifying, we obtain 
that B holds with a probabihty no lower than (29), as required. 

■ 

The next lemma demonstrates that, under suitable condi- 
tions, correlating b with the dictionary atoms a^ is an effective 
method of identifying the atoms participating in the support 

of Xq. 

Lemma 3: Let xg be a vector with support Aq = supp(xo) 
of size s = |Ao|, and let b = Axq + w for some noise vector 
w. Define la^minl and |a:;max| as in (27), and suppose that 

I - (2s - l)/x|a;max| > 2r. 

Then, if the event B of (54) holds, we have 



max la^bl > max laj'bl 



(62) 
(63) 



If, rather than (62), the stronger condition 

I - (2s- l)/u|a;max| > 2r (64) 

is given, then, under the event B, we have 

min |ajb| > maxjajbl. (65) 

Proof: The proof is an adaptation of [4, Lemma 5.2]. 
Beginning with the term maxj^Ao I^Jb|, we have, under the 
event B, 



max lajbl = max 



J^Ao 



T 

a^.w 



< max \auMv\ 4 

i^Ao 
<T + S/Lt|Xmax| 



^ ^ Xicij a^ 

ieAo 



E 



jGAo 



(66) 
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On the other hand, when B holds, 



max laXbl = max 



jeAo 



Xj + ajw + ^ XiaJ 

»6Ao\{i} 



^ P^max max 



T 



ET 
Xi&j a, 

*eAo\{i} 
> |a;max| - T - (s - l)/x|a;max| 

— I^maxl (2s l)/^j^max| T -j- 5//|Xn 



(67) 



Together with (66), this yields 



maxja^bl > |a;i„ax| - (2s- l)/i|a;i„ax| - 2r + max|a^b|. 

(68) 

Thus, under the condition (62), we obtain (63). Similarly, when 
B holds, we have 



min laTbl = min 



i6Ao 

^ I'^minl 
— I^^'minl 



x-i + a,- w + 



iGAo\{i} 



T - (S - l)/x|Xmax| 

(2s - l)/ti|Xi„ax| - T + S/i|Xi„ax|- 

(69) 



Again using (66), we obtain 



min a,- b > la^n 



(2s - \)ii\xr, 



2r + max |a,- b|. 

HAa 



(70) 

Consequently, under the assumption (64), we conclude that 
(65) holds, as required. ■ 

The following lemma bounds the performance of the oracle 
estimator under the event B. The usefulness of this lemma 
stems from the fact that, if either OMP or the thresholding 
algorithm correctly identify the support of xq, then their 
estimate is identical to that of the oracle. 

Lemma 4: Let xq be a vector with support Aq — supp(xo), 
and let b = Axq + w for some noise vector w. If the event 
B of (54) occurs, then 



||xor — X0II2 < 2scr^(l + a) logm- 



1 



(1 - (S - 

Proof: Note that both Xor and xq are supported on Aq, 
and therefore 



where, in the last step, we used the definition (54) of B and the 
fact that II AJ^ Aao || > 1 — (s — which was demonstrated 
in Appendix II. This completes the proof the lemma. ■ 

We are now ready to prove Theorem 4. The proof for the 
thresholding algorithm is obtained by combining the three 
lemmas presented above. Indeed, Lemma 2 ensures that the 
event B occurs with probability at least as high as the required 
probability of success (29). Whenever B occurs, we have by 
Lemma 3 that the atoms corresponding to Aq all have strictly 
higher correlation with b than the off-support atoms, so that 
the thresholding algorithm identifies the correct support Aq, 
and is thus equivalent to the oracle estimator Xor as long as B 
holds. Finally, by Lemma 4, identification of the true support 
Aq guarantees the required error (30). 

We now prove the OMP performance guarantee. Our aim 
is to show that when B occurs, OMP correctly identifies the 
support of Xq; the result then follows by Lemmas 2 and 4. 
To this end we employ the technique used in the proof of 
[4, Th. 5.1]. We begin by examining the first iteration of the 
OMP algorithm, in which one identifies the atom a^ whose 
correlation with b is maximal. Note that (28) implies (62), and 
therefore, by Lemma 3, the atom having the highest correlation 
with b corresponds to an element in the support Aq of xq. 
Consequently, the first step of the OMP algorithm correctly 
identifies an element in Ao. 

The proof now continues by induction. Suppose we are 
currently in the ith iteration of OMP, with 1 < z < s, and 
assume that atoms from the correct support were identified 
in all i — 1 previous steps. Referring to the notation used 
in the definition of OMP in Section II-B, this implies that 
supp(xJ3j^^p) = A'^-'^ C Aq. The zth step consists of iden- 
tifying the atom which is maximally correlated with the 
residual r*. By the definition of r', we have 

r* = Ax*-^ -I- w (74) 

where x'~^ = xq — x^p. Thus supp(x*~^) C Aq, so that 
is a noisy measurement of the vector Ax*^^, which has 
a sparse representation consisting of no more than s atoms. 
Now, since 

PompIIo = « - 1 < s = Ikollo, (75) 

it follows that at least one nonzero entry in x*~^ is equal to 
the corresponding entry in xo. Consequently 



max kr; 



(76) 



Xor - X0II2 = ||A]v^b - Xo,Ao||2 



(72) 



where xo.a,, is the subvector of nonzero entries of xq. We thus 
have, under the event B, 

llXor-Xolla = ||AJ^^Aa„Xo,Ao + Aj^^W-Xo,Aj|2 



|ALw 



= IKA^^AaJ 'A;;,„w||2 
<||(Al„AAj-i^^(arw)^ 

ieAo 

^3-^^L__,,22(l + a)logrn (73) 



< 



Note that the model (74) is precisely of the form (1), with 
taking the place of the measurements b and x'^^ taking the 
place of the sparse vector xq. It follows from (76) and (28) 
that this model satisfies the requirement (62). Consequently, 
by Lemma 3, we have that under the event B, 



max laj^r* 



> max laj'r*! 



Therefore, the ith iteration of OMP will choose an element 
within Aq to add to the support. By induction it follows that 
the first ,s steps of OMP all identify elements in Aq, and since 
OMP never chooses the same element twice, the entire support 
Aq will be identified after s iterations. This completes the 
proof of Theorem 4. 
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